Data Wrangling Project 4 Udacity

Project description:

In this project we will use all the steps of data wrangling to analize the data from the twitter account "WeRateDogs"

Read format 2 for image:

Read format 1:

Read format 2:

Creating a dataframe from tweet_json file:

Read format 3

Assess:

In this assessment we will use first a Visual assessment with excel and then a Programmatic assessment using pandas method and functions to identify quality and tidiness problems of the data gathered.

Visual assessment in excel:

Programmatic assessment:

Creating a copy of the datasets to make modifications

Assessing 8 Quality and 2 Tidiness issues

After the assessment of the 3 dataframes, the following problems were found:

Quality:

1) "timestamp" it has object as its datatype, it should be datetime datatype #OK
2) "tweet_id" for the 3 dataframes has a int64 datatype but it should be a string or object datatype. #OK
3) "jpg_urls" does not have images for every tweet_id when the 3 dataframes merched in 1
4) "rating_denominator" has 23 values different from 10 which is the default denominator. #OK
Removing requirement of retweets, only leaving original tweets.
5) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_id" #OK
6) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_user_id" #OK
7) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_timestamp" #OK
8) In "df_enhanced_copy" dataframe has NaN for "in_reply_to_status_id" #OK
9) In "df_enhanced_copy" dataframe has NaN for "in_reply_to_user_id #OK"

Tydiness:

1) In "df_enhanced_copy" dataframe the column timestamp includes date and time in the same column, it would be good to separate the date from the hour #OK
2) 4 variables as columns instead of 1 column: For the different characteristic name of the dog coming from the "Doggo lingo language says" such as doggo, floofer, pupper, puppo, are all in 4 different columns and not grouped by only 1 column which could help more for the analysis. #OK
3) It should be 1 dataframe that includes all the information, and it can be done because it has the twitter_id as a common variable. #OK

Cleaning:

Cleaning the 8 Quality and 2 Tidiness problems found before

First, for this step we will use the copies made before from the originals dataframes.

Quality.1

From the "df_enhanced_copy" dataframe "timestamp" has object as its datatype, it should be datetime datatype.

Describe.1

Change the datatype by using the to_datetime method from pandas

Test.1

Checking that the datatype change was made succesfully:

Quality.2

"tweet_id" for the 3 dataframes has a int64 datatype but it should be a string or object datatype.

Describe.2

To change the datatype it is necessary to use the .astype method to a fast change to string datatype

Test.2

Verifying the datatype change using .info()

Quality.4:

The column "rating_denominator" has 23 values different from 10 which is the default denominator.

Describe:

With .loc we will select only the rows that includes a different value from 10 as a rating denominator.
Then, we will replace that selection for 10 as the rating denominator

Testing.4:

Quality 5, 6, 7, 8 and 9:

5) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_id"
6) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_user_id"
7) In "df_enhanced_copy" dataframe has NaN for "retweeted_status_timestamp"
8) In "df_enhanced_copy" dataframe has NaN for "in_reply_to_status_id"
9) In "df_enhanced_copy" dataframe has NaN for "in_reply_to_user_id"

Describe:

First, we will Count de Nan for the columns named above to see if they are candidates to drop from the dataframe.
Then, we calculate the number of rows in the dataframe "df_enhanced_copy" to calculate the proportion on Nan in each column.

Because the percentage of Nan in all 5 columns ("retweeted_status_id", "retweeted_status_user_id", "retweeted_status_timestamp", "in_reply_to_status_id" and "in_reply_to_user_id") is above the 90% we will eliminate those columns from the dataframe "df_enhanced_copy"

However, before removing those columns for the dataframe, it is necessary to remove the retweets, because we only want to maintain only the original tweets:

Retweet requirement Cleaning:

To remove only the rows that had non null values for the columns retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp.
We use .isnull() that outputs the dataframe with only the rows that that column has null values, because of that we want to keep that selection as the "df_enhanced_copy dataframe", to exclude all the tweets that are retweets.

Retweet requirement Test:

We use .info() to verify if they are any values that are non null in the columns used above.

Now that we removed the retweets from the dataframe, we will drop the following columns due to their high percentage of Nan values.

Testing:

Veryfing with a list that the columns eliminated above are no longer in the dataframe:

Tidiness.1

In "df_enhanced_copy" dataframe the column timestamp includes date and time in the same column, it would be good to separate the date from the hour

Describe

First, we have to extract the hour, minute and second of the timestamp. After creating the columns, we have to concatenate the entire time in a single column called "time_timestamp"

Then, we extract the date from the timestamp column and create a new column with only the date

Finally, to not duplicate information, the other auxiliar columns (hour_timestamp, minute_timestamp and second_timestamp) and timestamp column is removed from the dataframe.

Test

To verify that the columns created and eliminated are no longer in the "df_enhaced_copy" dataframe we use list to show all the column names

Tidiness.2

4 variables as columns instead of 1 column in the "df_enhanced_copy" dataframe:
For the different characteristic name of the dog coming from the "Doggo lingo language says" such as doggo, floofer, pupper, puppo, are all in 4 different columns and not grouped by only 1 column which could help more for the analysis.

Describe

For this purpose, we will create a function "w" that identifies the rows in the "df_enhanced_copy" dataframe that does not have a doggo lingo, in other words, it has a "None" value.

Then, with pd.melt() we will turn the variables "puppo", "floofer", "doggo" and "pupper" into only one column named "doggo_lingo"

Finally, we will eliminate all the duplicates created by the method mentioned above

Testing:

Tidiness.3:

It should be 1 dataframe that includes all the information, and it can be done because it has the twitter_id as a common column.

Test:

Verify that all the 3 dataframes are in only one dataframe with their corresponding columns united:

Quality.3

"jpg_urls" does not have images for every tweet_id when the 3 dataframes merched in 1

Describe.3

Since we want to keep only the rows of the tweet_id that has an image, all the rows without a jpg_url will be removed from the dataframe.

With notnull() it is possible to select only the rows that has jpg_url.

Test.3:

Test if the null rows from jpg_url were actually removed from the dataframe

Quality.10:

Remove the 2 rows that have null values for "retweet_count" and "favorite_count"

Describe.10:

Because the rows with null values for retweet_count and favorite_count are the same, with only dropping the rows in one of the columns it will remove the null values from the dataframe.

For this, we will use the same method used before: notnull()

Test.10:

Test if the null rows from the columns retweet_count and favorite_count were actually removed from the dataframe

Store and export clean dataframe in csv file called "twitter_archive_master.csv" to avoid cleaning all dataframes in later analysis:

Visualization and Analysis

Now that we have the clean dataframe, we can make visualization to make a better data analysis.

Topics analyzed:

Top 10 more common names of dogs:

Top 10 least common names of dogs:

Top 5 tweets with the highest rating:

Image with the highest rating:

Because p1 is the algorithm's #1 prediction for the image in the tweet, we will see the breeds that appear more in the prediction:

Top 5 breeds predicted:

Most tweeted doggo lingo:

*Clean.11:

Transform dataype for retweet and favorite count to int type from float type

*Describe.11:

Using .astype(int) to transform the datatype for the columns retweet_count and favorite_count

*Test.11:

See if the dtype changed from float to int for retweet_count and favorite_count

Correlation matrix:

Important insights:

-From the correlation plot above we can see that favorite and retweet had a high correlation of 0.789513

-On the other side, contrary as expected the rating column (rating_numerator/rating_denominator) has little correlation with the number of retweets or favorites a tweet gets, which is demostrated by a correlation of 0.012748 for retweet_count and 0.010148 for favorite_count.

The highest favorited and 4th most retweeted dog:

Conclusion:

To clean the dataframe it is not only a one phase stage, because as I got into the analysis and visualization of the dataframe, I could still see some quality problems that could have been solved before but were not as visible in the Clean stage as when I was truly interacting with the data.

On the other side, data wrangling is one important step to get data analysis of quality and make good decisions based on a good profiled dataframe. However, it requires time and attention to notice problems. In adition, not always the gathering of the data will be as easy depending on where we get the data, mostly because of the missing values, different datatypes, or other problems.

On specific, from the data recollected of WeRateDogs we could do some analysis for the most and least common names of dogs, the dogs with the highest ratings, the most tweeted dog stage, and the highest favorite and retweeted dog in the account.

References used in this work: